Project: Amazon Product Recommendation System¶
This project builds a recommendation system using user ratings data from the Amazon Electronics category. We will preprocess the data, explore its characteristics, and implement a collaborative filtering approach using matrix factorization.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse
from scipy.sparse.linalg import svds
import warnings
warnings.filterwarnings('ignore')
Load and Preview Dataset¶
We load the ratings_Electronics.csv file, which contains user-item ratings for Amazon electronics products. This step helps us inspect the structure and contents of the data before proceeding with any analysis.
# Load the dataset
df = pd.read_csv('ratings_Electronics.csv', names=['user_id', 'product_id', 'rating', 'timestamp'])
# Display the first 5 rows
df.head()
| user_id | product_id | rating | timestamp | |
|---|---|---|---|---|
| 0 | AKM1MP6P0OYPR | 0132793040 | 5.0 | 1365811200 |
| 1 | A2CX7LUOHB2NDG | 0321732944 | 5.0 | 1341100800 |
| 2 | A2NWSAGRHCP8N5 | 0439886341 | 1.0 | 1367193600 |
| 3 | A2WNBOD3WNDNKT | 0439886341 | 3.0 | 1374451200 |
| 4 | A1GI0U4ZRJA8WN | 0439886341 | 1.0 | 1334707200 |
Observation¶
- The dataset consists of 4 columns:
user_id,product_id,rating, andtimestamp. - Each row represents a user’s rating for a specific electronic product on Amazon.
- Ratings are numerical values, and
timestampindicates when the rating was made. - There are no column headers in the CSV file, so we manually assigned them during loading.
Dataset Overview¶
We check the dataset's dimensions, data types, and whether there are any missing values. This helps us understand its structure and identify potential data quality issues.
# Shape of the dataset (rows, columns)
print("Dataset shape:", df.shape)
# Data types and non-null counts
print("\nData Info:")
print(df.info())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
Dataset shape: (7824482, 4) Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 7824482 entries, 0 to 7824481 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 user_id object 1 product_id object 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(1), object(2) memory usage: 238.8+ MB None Missing values: user_id 0 product_id 0 rating 0 timestamp 0 dtype: int64
Observation¶
- The dataset contains 7,824,482 rows and 4 columns.
- Column data types:
user_idandproduct_idare object (string) types.ratingis a float, likely ranging from 1.0 to 5.0.timestampis an integer, representing Unix time.
- No missing values are present in any of the columns.
- Memory usage is approximately 238.8 MB.
Rating Distribution¶
We examine how the ratings are distributed to understand user behavior. This helps us decide whether to keep all ratings or filter only higher ratings for building a more focused recommendation model.
# Distribution of rating values
rating_counts = df['rating'].value_counts().sort_index()
# Plot the rating distribution
plt.figure(figsize=(8, 5))
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette='Blues_d')
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.grid(axis='y')
plt.show()
# Also print counts
print(rating_counts)
rating 1.0 901765 2.0 456322 3.0 633073 4.0 1485781 5.0 4347541 Name: count, dtype: int64
Observation¶
- The majority of ratings are 5.0, accounting for more than half of the dataset (≈4.3 million ratings).
- There's a clear skew toward positive ratings:
- 5.0: 4,347,541 ratings
- 4.0: 1,485,781 ratings
- Lower ratings (1.0 to 3.0) are much less frequent.
- This imbalance suggests that users tend to rate products they like, which is a common trend in recommendation system datasets.
- For building a recommendation model, we might consider filtering for ratings ≥ 4.0 to focus on strong user preferences.
Why Filter for Ratings ≥ 4.0?¶
Filtering the dataset to keep only higher ratings (typically 4.0 and 5.0) is a common preprocessing step in recommendation systems. Here's why this approach is often preferred:
1. Focus on Positive Preferences¶
Most recommendation algorithms, especially collaborative filtering, are designed to identify patterns in what users like. Including only strong positive signals (ratings ≥ 4.0) helps the model learn user preferences more effectively.
- A rating of 4 or 5 generally indicates satisfaction.
- Lower ratings may reflect many unrelated factors (product defects, shipping issues, etc.) and introduce noise.
2. Mimicking Implicit Feedback¶
Many modern recommendation engines operate using implicit feedback — such as clicks, views, or purchases — rather than explicit 1–5 ratings.
- Filtering for high ratings emulates implicit positive signals.
- It helps the model interpret user behavior more like: "this user engaged positively with this item."
3. Reducing Sparsity and Noise¶
User-item interaction matrices are typically very sparse, especially with millions of products.
- Filtering out low ratings reduces the size of the matrix.
- The remaining data has a higher signal-to-noise ratio, improving both training speed and recommendation accuracy.
4. Improved Interpretability and Performance¶
- Recommendations are more interpretable when they are based on what users loved, rather than trying to balance positive and negative feedback.
- Filtering reduces computational complexity and helps matrix factorization techniques (like SVD) converge faster.
5. When Not to Filter¶
There are scenarios where retaining all ratings makes sense:
- If your goal is to predict the exact rating a user would give (regression model).
- If negative feedback is informative for your application (e.g., to avoid bad experiences).
However, for ranking top-N recommendations, filtering for higher ratings is typically beneficial.
Filter for Positive Ratings (≥ 4.0)¶
To reduce noise and focus on strong user preferences, we filter the dataset to include only ratings of 4.0 and 5.0. This step helps the recommendation system focus on what users truly liked, improving both training efficiency and output relevance.
# Filter the dataset for ratings >= 4.0
df_filtered = df[df['rating'] >= 4.0]
# Check the new shape and a preview
print("Filtered dataset shape:", df_filtered.shape)
df_filtered.head()
Filtered dataset shape: (5833322, 4)
| user_id | product_id | rating | timestamp | |
|---|---|---|---|---|
| 0 | AKM1MP6P0OYPR | 0132793040 | 5.0 | 1365811200 |
| 1 | A2CX7LUOHB2NDG | 0321732944 | 5.0 | 1341100800 |
| 5 | A1QGNMC6O1VW39 | 0511189877 | 5.0 | 1397433600 |
| 7 | A2TY0BTJOTENPG | 0511189877 | 5.0 | 1395878400 |
| 8 | A34ATBPOK6HCHY | 0511189877 | 5.0 | 1395532800 |
Observation¶
- After filtering for ratings ≥ 4.0, the dataset now contains 5,833,322 entries, down from the original 7.8 million.
- This means we've removed approximately 25% of the lower-rated data (ratings of 1.0 to 3.0).
- The filtered dataset retains only strong positive interactions, which is ideal for building a recommendation system focused on user satisfaction.
- The structure of the dataset remains the same, with 4 columns:
user_id,product_id,rating, andtimestamp.
Analyze User and Product Activity¶
We examine how many unique users and products are in the filtered dataset. We also explore how many ratings each user has given and how many ratings each product has received. This helps us understand user engagement levels and item popularity.
# Number of unique users and products
num_users = df_filtered['user_id'].nunique()
num_products = df_filtered['product_id'].nunique()
print(f"Number of unique users: {num_users}")
print(f"Number of unique products: {num_products}")
# Count of ratings per user
user_activity = df_filtered['user_id'].value_counts()
# Count of ratings per product
product_popularity = df_filtered['product_id'].value_counts()
# Show the top 5 most active users and most rated products
print("\nTop 5 most active users:")
print(user_activity.head())
print("\nTop 5 most rated products:")
print(product_popularity.head())
Number of unique users: 3256144 Number of unique products: 410110 Top 5 most active users: user_id A3OXHLG6DIBRW8 464 ADLVFFE4VBT8 414 A5JLAU2ARJ0BO 358 A1ODOGXEYECQQ8 346 A6FIAB28IS79 340 Name: count, dtype: int64 Top 5 most rated products: product_id B0074BW614 16098 B007WTAJTO 12244 B0019EHU8G 11640 B00DR0PDNE 11604 B006GWO5WK 10048 Name: count, dtype: int64
Observation¶
- The filtered dataset contains:
- 3,256,144 unique users
- 410,110 unique products
- The most active users have submitted several hundred ratings:
- The top user has rated 464 products.
- The second and third most active users have rated 414 and 358 products respectively.
- Some products have extremely high numbers of ratings:
- The most rated product has received 16,098 ratings.
- Several others exceed 10,000 ratings.
This indicates a highly skewed distribution, where:
- A small subset of users are very active.
- A small number of products are very popular.
This kind of skew is common in recommendation datasets and is useful to know when considering filtering strategies for cold-start users or long-tail items.
Why Filter Low-Activity Users and Rarely-Rated Products?¶
While our dataset is rich, it has millions of users and hundreds of thousands of products — but many of those users or items only appear a few times. This can cause several issues:
Why Filter?¶
Sparsity Reduction
- Sparse user-item matrices slow down training and can weaken the model’s ability to find reliable patterns.
Cold-Start Problems
- Users/products with very few interactions have too little data for the algorithm to learn meaningful preferences.
Model Stability
- Users who’ve only rated 1–2 items don’t contribute much and may even skew results.
Performance Boost
- Smaller, denser matrices train faster and often result in more accurate recommendations.
Strategy¶
We'll filter to keep:
- Users who have rated at least 50 products
- Products that have been rated by at least 50 users
These thresholds are commonly used to balance coverage and quality.
Filter for Active Users and Popular Products¶
To reduce matrix sparsity and improve recommendation quality, we keep only:
- Users who have rated at least 50 products
- Products that have been rated by at least 50 users
This results in a denser dataset with stronger interaction signals.
# Keep users who have rated at least 50 products
active_users = user_activity[user_activity >= 50].index
df_filtered = df_filtered[df_filtered['user_id'].isin(active_users)]
# Recalculate product counts after user filtering
product_popularity = df_filtered['product_id'].value_counts()
# Keep products that have been rated at least 50 times
popular_products = product_popularity[product_popularity >= 50].index
df_filtered = df_filtered[df_filtered['product_id'].isin(popular_products)]
# Check updated shape
print("Filtered dataset shape after activity thresholds:", df_filtered.shape)
# Optional: Preview
df_filtered.head()
Filtered dataset shape after activity thresholds: (2770, 4)
| user_id | product_id | rating | timestamp | |
|---|---|---|---|---|
| 483958 | ADLVFFE4VBT8 | B0002L5R78 | 5.0 | 1229212800 |
| 484344 | A3G5MOHY1U635N | B0002L5R78 | 5.0 | 1362009600 |
| 484352 | A19W47CXJJP1MI | B0002L5R78 | 5.0 | 1323129600 |
| 484410 | A12DQZKRKTNF5E | B0002L5R78 | 5.0 | 1325116800 |
| 485096 | A25UZ7MA72SMKM | B0002L5R78 | 5.0 | 1276732800 |
Observation¶
- After filtering for activity thresholds (≥ 50 ratings per user and per product), the dataset now contains 2,770 entries.
- This is a significant reduction from over 5 million interactions — but now:
- All users included have demonstrated consistent engagement.
- All products included have sufficient feedback for modeling.
- While this results in a smaller dataset, it improves data density, reduces noise, and allows for more reliable collaborative filtering.
This subset is especially useful for testing or prototyping matrix-based models like SVD, NMF, or KNN. In a real
Create the User-Item Matrix¶
We pivot the filtered dataset into a user-item matrix, where:
- Rows represent users
- Columns represent products
- Cells contain the user's rating for that product
This matrix will be sparse — most users haven't rated most products — but it forms the basis for collaborative filtering techniques.
# Create a pivot table (user-item matrix)
user_item_matrix = df_filtered.pivot_table(
index='user_id',
columns='product_id',
values='rating'
)
# Fill NaNs with 0 (optional, only if using models that expect filled matrix)
# user_item_matrix = user_item_matrix.fillna(0)
# Display shape and a preview
print("User-Item Matrix shape:", user_item_matrix.shape)
user_item_matrix.head()
User-Item Matrix shape: (828, 38)
| product_id | B0002L5R78 | B000JMJWV2 | B000LRMS66 | B000N99BBC | B000QUUFRW | B000VX6XL6 | B0019EHU8G | B001E1Y5O6 | B001TH7GUU | B002R5AM7C | ... | B00829TIEK | B0082E9K7U | B00834SJNA | B00834SJSK | B0088CJT4U | B008DWCRQW | B009SYZ8OC | B00BOHNYTW | B00G4UQ6U8 | B00HFRWWAM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| A100UD67AHFODS | NaN | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| A100WO06OQR8BQ | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | 5.0 | ... | NaN | NaN | NaN | NaN | 4.0 | 5.0 | NaN | NaN | NaN | NaN |
| A10AFVU66A79Y1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| A10NMELR4KX0J6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| A10O7THJ2O20AG | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 38 columns
Observation¶
- The user-item matrix has 828 users (rows) and 38 products (columns).
- Each cell contains a rating (either 4.0 or 5.0), or NaN if the user hasn't rated that product.
- The matrix is sparse — most users have only rated a few of the available products.
- This format is ideal for applying collaborative filtering methods such as SVD (Singular Value Decomposition) or KNN-based similarity models.
Next, we'll apply SVD to this matrix to generate personalized recommendations based on latent user and item features.
Apply Matrix Factorization using SVD¶
We apply Singular Value Decomposition (SVD) to the user-item matrix to uncover latent factors that explain user preferences. SVD factorizes the matrix into three components:
- U: User-feature matrix
- Σ (sigma): Diagonal matrix of singular values
- Vt: Transposed item-feature matrix
We then use these to reconstruct an approximation of the full user-item matrix with predicted ratings.
# Fill NaNs with 0 for SVD
matrix_filled = user_item_matrix.fillna(0)
# Convert to numpy array
R = matrix_filled.values
# Number of latent factors
k = 15
# Apply SVD
U, sigma, Vt = svds(R, k=k)
# Convert sigma (1D) into a diagonal matrix
sigma = np.diag(sigma)
# Reconstruct the ratings matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt)
# Convert back to DataFrame for easier handling
predicted_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)
# Preview
predicted_df.head()
| product_id | B0002L5R78 | B000JMJWV2 | B000LRMS66 | B000N99BBC | B000QUUFRW | B000VX6XL6 | B0019EHU8G | B001E1Y5O6 | B001TH7GUU | B002R5AM7C | ... | B00829TIEK | B0082E9K7U | B00834SJNA | B00834SJSK | B0088CJT4U | B008DWCRQW | B009SYZ8OC | B00BOHNYTW | B00G4UQ6U8 | B00HFRWWAM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| A100UD67AHFODS | 0.197361 | 0.216765 | 0.427702 | 0.063351 | 0.363190 | 0.495266 | 0.018857 | 0.328507 | -0.245672 | -0.011941 | ... | 0.280657 | 1.814255 | -0.223170 | -0.466789 | -0.012076 | 0.181163 | 0.084884 | -0.225589 | 1.647371 | 0.144609 |
| A100WO06OQR8BQ | 0.772218 | 0.342182 | 0.195361 | 0.472087 | -0.826224 | 0.967543 | 0.399888 | 0.456739 | 0.499931 | 3.059293 | ... | 0.181453 | 0.850221 | 0.457965 | -0.024183 | 5.572918 | 3.233693 | 0.810372 | -0.100217 | 0.460520 | 0.641099 |
| A10AFVU66A79Y1 | -0.055168 | 0.251909 | 0.380853 | -0.248796 | 1.165782 | -0.017622 | 0.043914 | 0.567286 | 0.298237 | 0.443113 | ... | -0.160385 | 0.185604 | 0.002711 | 0.088393 | -0.090563 | -0.042144 | 0.219962 | 0.269924 | 0.026083 | 0.104036 |
| A10NMELR4KX0J6 | -0.283842 | -0.417997 | -0.253220 | 0.402361 | 0.339536 | -0.215334 | -0.307922 | 0.147431 | -0.003668 | -0.874813 | ... | -0.092343 | 0.358711 | -0.083676 | 0.427923 | 0.548679 | -0.029192 | -0.052559 | 0.413422 | 0.320292 | 0.109375 |
| A10O7THJ2O20AG | -0.076884 | 0.431264 | 0.733881 | 0.464186 | 2.779335 | 0.286990 | 1.177197 | 0.001412 | 0.573502 | -1.479514 | ... | 0.600708 | 0.395454 | -0.016110 | -0.624510 | 0.187761 | -0.670853 | 0.457595 | 0.580788 | 0.756217 | -0.174561 |
5 rows × 38 columns
Observation¶
- The SVD algorithm decomposed the user-item matrix into three components and reconstructed an approximation of the full matrix with predicted ratings.
- The resulting matrix has the same shape as the original (828 users × 38 products).
- Each cell now contains a predicted rating for how much a user would likely rate an unrated product, based on patterns learned from the data.
- These predicted ratings can be used to generate personalized top-N product recommendations for each user.
Generate Top-N Recommendations¶
Using the predicted ratings from the SVD output, we select the top N products with the highest predicted scores for each user — excluding the ones they've already rated.
This gives us a personalized recommendation list for every user based on their inferred preferences.
def get_top_n_recommendations(predictions_df, original_df, user_id, n=5):
"""
Returns top-N product recommendations for a given user based on predicted ratings.
Parameters:
predictions_df (DataFrame): SVD-predicted ratings matrix
original_df (DataFrame): Original filtered user-item ratings
user_id (str): The user for whom to generate recommendations
n (int): Number of top recommendations to return
Returns:
DataFrame: Top-N recommended products with predicted ratings
"""
# Get products already rated by the user
rated_products = original_df[original_df['user_id'] == user_id]['product_id'].tolist()
# Get predicted ratings for the user, sort by highest predicted score
user_predictions = predictions_df.loc[user_id].drop(rated_products)
top_n = user_predictions.sort_values(ascending=False).head(n)
return top_n.reset_index().rename(columns={user_id: 'predicted_rating'})
# Example: Top 5 recommendations for one user
example_user = predicted_df.index[0]
get_top_n_recommendations(predicted_df, df_filtered, example_user, n=5)
| product_id | predicted_rating | |
|---|---|---|
| 0 | B00G4UQ6U8 | 1.647371 |
| 1 | B0079UAT0A | 1.169006 |
| 2 | B002V88HFE | 0.708071 |
| 3 | B0041Q38NU | 0.617315 |
| 4 | B000VX6XL6 | 0.495266 |
Observation¶
- We successfully generated Top-5 product recommendations for a sample user based on predicted ratings.
- The recommended products are ranked by predicted preference score — even though the user hasn't rated them before.
- Example recommendations:
B00G4UQ6U8with predicted rating ~1.65B0079UAT0Awith predicted rating ~1.17- ...
- These scores are relative and don't need to match the original 1–5 rating scale exactly, as they come from the low-rank matrix approximation.
This final step enables us to deliver personalized product suggestions using learned latent features — a core capability in modern recommendation systems.
📘 Summary and Next Steps¶
✅ What We Accomplished¶
In this project, we built a simple yet effective product recommendation system using Amazon Electronics ratings data:
Data Preprocessing:
- Loaded a large dataset of ~7.8 million Amazon ratings.
- Filtered for strong positive feedback (ratings ≥ 4.0) to focus on user preferences.
- Further filtered for active users and popular products to reduce noise and sparsity.
Exploratory Analysis:
- Analyzed rating distribution, user behavior, and item popularity.
- Observed the long-tail nature of both users and products.
Collaborative Filtering via SVD:
- Created a user-item matrix from the filtered data.
- Applied Singular Value Decomposition (SVD) to learn latent features.
- Reconstructed a predicted ratings matrix.
- Generated Top-N recommendations for users based on these predictions.
🧠 Skills Demonstrated¶
- Data wrangling with Pandas
- Data visualization with Seaborn/Matplotlib
- Dimensionality reduction using SciPy SVD
- Recommendation system logic and matrix factorization
- Building reusable recommendation functions
🚀 Potential Next Steps¶
- Add content-based filtering using product metadata or titles.
- Combine multiple models into a hybrid recommender system.
- Deploy the recommendation engine as a web app using Flask or Streamlit.
- Add evaluation metrics (e.g., RMSE, Precision@K) with a train/test split.
- Scale up to use full dataset with implicit feedback libraries like
Surprise,LightFM, orimplicit.